This is an exploratory data analysis for red wines data set, This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The data can be downloaded here Also this text file explaining the data can be useful

## Observations: 1,599
## Variables: 13
## $ X                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...
##                variable    mean std_dev variation_coef   p_01   p_05
## 1                     X 800.000 4.6e+02         0.5772 16.980 80.900
## 2         fixed.acidity   8.320 1.7e+00         0.2093  5.200  6.100
## 3      volatile.acidity   0.528 1.8e-01         0.3392  0.190  0.270
## 4           citric.acid   0.271 1.9e-01         0.7189  0.000  0.000
## 5        residual.sugar   2.539 1.4e+00         0.5554  1.400  1.590
## 6             chlorides   0.087 4.7e-02         0.5381  0.043  0.054
## 7   free.sulfur.dioxide  15.875 1.0e+01         0.6589  3.000  4.000
## 8  total.sulfur.dioxide  46.468 3.3e+01         0.7079  8.000 11.000
## 9               density   0.997 1.9e-03         0.0019  0.992  0.994
## 10                   pH   3.311 1.5e-01         0.0466  2.930  3.060
## 11            sulphates   0.658 1.7e-01         0.2576  0.420  0.470
## 12              alcohol  10.423 1.1e+00         0.1022  9.000  9.200
## 13              quality   5.636 8.1e-01         0.1433  4.000  5.000
##      p_25    p_50    p_75    p_95    p_99 skewness kurtosis     iqr
## 1  400.50 800.000 1199.50 1519.10 1583.02    0.000      1.8 8.0e+02
## 2    7.10   7.900    9.20   11.80   13.30    0.982      4.1 2.1e+00
## 3    0.39   0.520    0.64    0.84    1.02    0.671      4.2 2.5e-01
## 4    0.09   0.260    0.42    0.60    0.70    0.318      2.2 3.3e-01
## 5    1.90   2.200    2.60    5.10    8.31    4.536     31.5 7.0e-01
## 6    0.07   0.079    0.09    0.13    0.36    5.675     44.6 2.0e-02
## 7    7.00  14.000   21.00   35.00   50.02    1.249      5.0 1.4e+01
## 8   22.00  38.000   62.00  112.10  145.00    1.514      6.8 4.0e+01
## 9    1.00   0.997    1.00    1.00    1.00    0.071      3.9 2.2e-03
## 10   3.21   3.310    3.40    3.57    3.70    0.194      3.8 1.9e-01
## 11   0.55   0.620    0.73    0.93    1.26    2.426     14.7 1.8e-01
## 12   9.50  10.200   11.10   12.50   13.40    0.860      3.2 1.6e+00
## 13   5.00   6.000    6.00    7.00    8.00    0.218      3.3 1.0e+00
##            range_98        range_80
## 1  [16.98, 1583.02] [160.8, 1439.2]
## 2       [5.2, 13.3]     [6.5, 10.7]
## 3      [0.19, 1.02]    [0.31, 0.74]
## 4          [0, 0.7]    [0.01, 0.52]
## 5       [1.4, 8.31]      [1.7, 3.6]
## 6      [0.04, 0.36]    [0.06, 0.11]
## 7        [3, 50.02]         [5, 31]
## 8          [8, 145]      [14, 93.2]
## 9         [0.99, 1]       [0.99, 1]
## 10      [2.93, 3.7]    [3.12, 3.51]
## 11     [0.42, 1.26]     [0.5, 0.85]
## 12        [9, 13.4]       [9.3, 12]
## 13           [4, 8]          [5, 7]
variable mean std_dev variation_coef p_01 p_05 p_25 p_50 p_75 p_95 p_99 skewness kurtosis iqr range_98 range_80
X 800.00 461.74 0.58 16.98 80.90 400.50 800.00 1199.50 1519.10 1583.02 0.00 1.8 799.00 [16.98, 1583.02] [160.8, 1439.2]
fixed.acidity 8.32 1.74 0.21 5.20 6.10 7.10 7.90 9.20 11.80 13.30 0.98 4.1 2.10 [5.2, 13.3] [6.5, 10.7]
volatile.acidity 0.53 0.18 0.34 0.19 0.27 0.39 0.52 0.64 0.84 1.02 0.67 4.2 0.25 [0.19, 1.02] [0.31, 0.74]
citric.acid 0.27 0.19 0.72 0.00 0.00 0.09 0.26 0.42 0.60 0.70 0.32 2.2 0.33 [0, 0.7] [0.01, 0.52]
residual.sugar 2.54 1.41 0.56 1.40 1.59 1.90 2.20 2.60 5.10 8.31 4.54 31.5 0.70 [1.4, 8.31] [1.7, 3.6]
chlorides 0.09 0.05 0.54 0.04 0.05 0.07 0.08 0.09 0.13 0.36 5.68 44.6 0.02 [0.04, 0.36] [0.06, 0.11]
free.sulfur.dioxide 15.87 10.46 0.66 3.00 4.00 7.00 14.00 21.00 35.00 50.02 1.25 5.0 14.00 [3, 50.02] [5, 31]
total.sulfur.dioxide 46.47 32.90 0.71 8.00 11.00 22.00 38.00 62.00 112.10 145.00 1.51 6.8 40.00 [8, 145] [14, 93.2]
density 1.00 0.00 0.00 0.99 0.99 1.00 1.00 1.00 1.00 1.00 0.07 3.9 0.00 [0.99, 1] [0.99, 1]
pH 3.31 0.15 0.05 2.93 3.06 3.21 3.31 3.40 3.57 3.70 0.19 3.8 0.19 [2.93, 3.7] [3.12, 3.51]
sulphates 0.66 0.17 0.26 0.42 0.47 0.55 0.62 0.73 0.93 1.26 2.43 14.7 0.18 [0.42, 1.26] [0.5, 0.85]
alcohol 10.42 1.07 0.10 9.00 9.20 9.50 10.20 11.10 12.50 13.40 0.86 3.2 1.60 [9, 13.4] [9.3, 12]
quality 5.64 0.81 0.14 4.00 5.00 5.00 6.00 6.00 7.00 8.00 0.22 3.3 1.00 [4, 8] [5, 7]

From the info above now we know there are 1599 observations (rows) and 13 variables (columns) we can also see some statistical info for each of the variables if needed. Also the plot shows that the majority of the wines tested has a low residual sugar content, and low chlorides.




## How Many Wines in Each Rating Group?

Which Wines has a rating of 8? (Highest Rating in Data Frame)

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
7.9 0.35 0.46 3.6 0.078 15 37 0.9973 3.35 0.86 12.8 8
10.3 0.32 0.45 6.4 0.073 5 13 0.9976 3.23 0.82 12.6 8
5.6 0.85 0.05 1.4 0.045 12 88 0.9924 3.56 0.82 12.9 8
12.6 0.31 0.72 2.2 0.072 6 29 0.9987 2.88 0.82 9.8 8
11.3 0.62 0.67 5.2 0.086 6 19 0.9988 3.22 0.69 13.4 8
9.4 0.3 0.56 2.8 0.08 6 17 0.9964 3.15 0.92 11.7 8
10.7 0.35 0.53 2.6 0.07 5 16 0.9972 3.15 0.65 11 8
10.7 0.35 0.53 2.6 0.07 5 16 0.9972 3.15 0.65 11 8
5 0.42 0.24 2 0.06 19 50 0.9917 3.72 0.74 14 8
7.8 0.57 0.09 2.3 0.065 34 45 0.99417 3.46 0.74 12.7 8
9.1 0.4 0.5 1.8 0.071 7 16 0.99462 3.21 0.69 12.5 8
10 0.26 0.54 1.9 0.083 42 74 0.99451 2.98 0.63 11.8 8
7.9 0.54 0.34 2.5 0.076 8 17 0.99235 3.2 0.72 13.1 8
8.6 0.42 0.39 1.8 0.068 6 12 0.99516 3.35 0.69 11.7 8
5.5 0.49 0.03 1.8 0.044 28 87 0.9908 3.5 0.82 14 8
7.2 0.33 0.33 1.7 0.061 3 13 0.996 3.23 1.1 10 8
7.2 0.38 0.31 2 0.056 15 29 0.99472 3.23 0.76 11.3 8
7.4 0.36 0.3 1.8 0.074 17 24 0.99419 3.24 0.7 11.4 8




Which Wines Has Rating of 3? (Lowest Rating in the Data Frame)

X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
460 11.6 0.58 0.66 2.2 0.074 10 47 1.0008 3.25 0.57 9 3
518 10.4 0.61 0.49 2.1 0.2 5 16 0.9994 3.16 0.63 8.4 3
691 7.4 1.185 0 4.25 0.097 5 14 0.9966 3.63 0.54 10.7 3
833 10.4 0.44 0.42 1.5 0.145 34 48 0.99832 3.38 0.86 9.9 3
900 8.3 1.02 0.02 3.4 0.084 6 11 0.99892 3.48 0.49 11 3
1300 7.6 1.58 0 2.1 0.137 5 9 0.99476 3.5 0.4 10.9 3
1375 6.8 0.815 0 1.2 0.267 16 29 0.99471 3.32 0.51 9.8 3
1470 7.3 0.98 0.05 2.1 0.061 20 49 0.99705 3.31 0.55 9.7 3
1479 7.1 0.875 0.05 5.7 0.082 3 14 0.99808 3.4 0.52 10.2 3
1506 6.7 0.76 0.02 1.8 0.078 6 12 0.996 3.55 0.63 9.95 3




Is There Any Correlation Between Any of The Variables and the Wine Rating?





From the correlation matrix below we can see the 2 variables that has the highest correlation to quality are alcohol level and volatile acidity.

Is There Any strong Relationship (Correlation) Between One Chemical Property and Another?

To answer this question and investigate more I am going to select the variables that has a correlation higher than 0.60 or -0.60
We see in the plot below that the higher the citric acid the higher the fixed acidity is



We see similar correlation between the density and fixed acidity


We see in the plot below that as pH levels gets higher fixed acidity gets lower (negative correlation)

Final Plots and Summary (Interactive Plots!)

In the correlation section above we plotted a box plot for all variables in the data frame to show correlation below is the same chart for the correlation between Volatile Acidity and the wine rating.

We can summaries from the chart below that for wines that scored a rating of 8 the volatile acidity range is smaller (judged by the size of the box and whiskers in the plot) and the majority of the volatile acidity range for the wines rated 8 is between 0.49 and 0.33.
On the other hand the wines that scored 3 in the rating has a larger range in the volatile acidity and the majority falls between 1.2 and 0.61 which lead me to believe that lower volital acidity lead to higher rating



Similair to the correlation plot above, the plot below shows the realtionship/correlation between Sulphates and Rating as we can see the wines rated 8 has a slightly higher sulpahte levels than wines rated 3. However there are some outliers which migh effect the correlation and further analysis might be needed.



Final plot we are goint to look at from the correlation section is the alcohol to rating plot. This plot shows clear relationship between higher alcohol percentage and higher rating.
* The majority of the wines rated 9 has an alcohol level ranging from 11.30% to 12.90% * The majority of wines rated 3 has an alcohol level ranging from 9.70% to 10.70%

Reflection

The data set contained rating and chemical properties of 1500 wines tested by experts and rated on a scale of 0-10(very bad - excellent), although the data set we have only has ratings from 3-10. I started by analyzing the data set and knowing what are the variables and the data type for each variable, then moved visualizing the ratings and the correlation between the data set variables.

There are some variables that has some correlation with the rating given to the wine. However, I don’t think the data is strong enough to suggest that a wine a=is rated higher due to a specific chemical property. I believe knowing the circumstances of the judges during the rating process, for example what kind of food have they eaten during the day of the rating? did the judges do multiple wines in the same day? all these questions and more can help us understand the data better and reach better conclusions.

Limitations I faced is not knowing how the rating was given and within what time period.

Sources